Towards Building a Corpus-based Dictionary for Non-word- boundary Languages
نویسندگان
چکیده
Corpus-based lexicography is an effective task for building a dictionary for languages, which exhibit explicit word boundaries. However, for nonword-boundary languages such as Japanese, Chinese and Thai, it is an arduous job. Because in these languages, there are no clear criteria what words are, the most difficult task for building a corpus-based dictionary for these languages is the process of selecting word list or lexicon entries. We propose a practical solution for this task by applying the c4.5 learning algorithm for building the lexicon list. Applying our algorithm with Thai corpora, the experiment yields promising results about 85% in both training and test corpus.
منابع مشابه
Automated Building of Sentence-Level Parallel Corpus and Chinese-Hungarian Dictionary
Decades of work have been conducted on automated building of parallel corpus and bilingual dictionary in the field of natural language processing. However, rarely have any studies been done between high-density character-based languages and medium-density word-based languages due to the lack of resources and fundamental linguistic differences. In this paper, we describe a methodology for creati...
متن کاملBuilding Bilingual Corpus based on Hybrid Approach for Myanmar-English Machine Translation
Word alignment in bilingual corpora has been an active research topic in the Machine Translation research groups. In this paper, we describe an alignment system that aligns English-Myanmar texts at word level in parallel sentences. Essential for building parallel corpora is the alignment of translated segments with source segments. Since word alignment research on Myanmar and English languages ...
متن کاملNon-Dictionary-Based Thai Word Segmentation Using Decision Trees
For languages without word boundary delimiters, dictionaries are needed for segmenting running texts. This figure makes segmentation accuracy depend significantly on the quality of the dictionary used for analysis. If the dictionary is not sufficiently good, it will lead to a great number of unknown or unrecognized words. These unrecognized words certainly reduce segmentation accuracy. To solve...
متن کاملمدل ترجمه عبارت-مرزی با استفاده از برچسبهای کمعمق نحوی
Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...
متن کاملSynchronizing Translated Movie Subtitles
This paper addresses the problem of synchronizing movie subtitles, which is necessary to improve alignment quality when building a parallel corpus out of translated subtitles. In particular, synchronization is done on the basis of aligned anchor points. Previous studies have shown that cognate filters are useful for the identification of such points. However, this restricts the approach to rela...
متن کامل